1. Question: Please refer the table below to answer below questions:
| Planned | Purchased | NotPurchased | Total |
|---|---|---|---|
| Yes | 400 | 100 | 500 |
| No | 200 | 1300 | 1500 |
| Total | 600 | 1400 | 2000 |
import pandas as pd
df=pd.DataFrame()
#df.columns=["Planned","Purchased","NotPurchased"]
df["Planned"]=["Yes","No","Total"]
df["Purchased"]=[400,200,600]
df["NotPurchased"]=[100,1300,1400]
df["Total"]=[500,1500,2000]
df=df.set_index("Planned")
df
| Purchased | NotPurchased | Total | |
|---|---|---|---|
| Planned | |||
| Yes | 400 | 100 | 500 |
| No | 200 | 1300 | 1500 |
| Total | 600 | 1400 | 2000 |
#People who planed and purchased
# event : Plan:YES & Purchase:Yes
event_result = df.loc["Yes","Purchased"]
# population : whole lot
population = df.loc["Total","Total"]
P_plan_n_purch = event_result / population
print("The joint probability of people who \n\
planned to purchase and actually placed an order \
is %.2f"%(P_plan_n_purch))
The joint probability of people who planned to purchase and actually placed an order is 0.20
# given the people who planned to purchase
# probaility of people who purchased
# event : Plan:YES & Purchase:Yes
# population : Planned:YES
# Using Conditional Probabilites
# Probability of people planned
P_plan = df.loc["Yes","Total"]/df.loc["Total","Total"]
# Intersection probability of people planned & purchased
P_plan_n_purch
# Conditional Probaility
P_purch_given_plan = P_plan_n_purch / P_plan
print("The conditional probability of people who \n\
planned to purchase and actually placed an order \n\
ginen the set of people who planned to purchase is %.2f"%(P_purch_given_plan))
The conditional probability of people who planned to purchase and actually placed an order ginen the set of people who planned to purchase is 0.80
2. Question: An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.
import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
# historical failure rate
p = 0.05
# number of samples taken
n = 10
# since we are dealing with failure / success of product
# this follows a binomial distribution
# lets form a pmf & cdf for 10 samples
k=np.arange(0,11)
bin_pmf = stats.binom.pmf(k,n,p)
bin_cdf = stats.binom.cdf(k,n,p)
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.plot(k,bin_pmf,'o-')
plt.title("PMF")
plt.subplot(1,2,2)
plt.plot(k,bin_cdf,'o-')
plt.title("CDF");
A.Probability that none of the items are defective?
print("Probability that none of the items are defective \
is %.6f"%(bin_pmf[0]))
Probability that none of the items are defective is 0.598737
B.Probability that exactly one of the items is defective?
print("Probability that exactly one of the items is defective \
is %.6f"%(bin_pmf[1]))
Probability that exactly one of the items is defective is 0.315125
C.Probability that two or fewer of the items are defective?
print("Probability that two or fewer of the items are defective \
is %.6f"%(bin_cdf[2]))
Probability that two or fewer of the items are defective is 0.988496
bin_pmf[:3].sum()
D.Probability that three or more of the items are defective ?
print("Probability that three or more of the items are defective \
is %.6f"%(1-bin_cdf[2]))
Probability that three or more of the items are defective is 0.011504
bin_pmf[3:].sum()
3. Question: A car salesman sells on an average 3 cars per week.
# rate pf sale or mean
rate = 3
# since number of trial is unkown and can tend to infinity
# this could follow Poissons distribution
# lets build a pmf for 10 sales
n = np.arange(0,11)
poi_pmf = stats.poisson.pmf(n,rate)
A. Probability that in a given week he will sell some cars.
print("Probability that in a given week he will sell some cars \
is %.6f"%(1-poi_pmf[0])) # selling anything above 0 cars
Probability that in a given week he will sell some cars is 0.950213
B. Probability that in a given week he will sell 2 or more but less than 5 cars.
print("Probability that in a given week \
he will sell 2 or more but less than 5 cars \
is %.6f"%(poi_pmf[2:5].sum())) # sum of all probabilities selling 2, 3, 4 cars
Probability that in a given week he will sell 2 or more but less than 5 cars is 0.616115
C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold perweek.
# lets build the cdf
plt.plot(n,poi_pmf,'o-')
plt.title("PMF")
plt.xlabel("number of cars sold per-week")
plt.ylabel("cumulative probability of cars sold per-week");
4. Question: Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.
# we are concerned of success & failure of 3 orders
# hence this must follow binomial distribution
# lets build a pmf for 3 trials
n = 3
p = 0.868
k = np.arange(0,4)
bi_p = stats.binom.pmf(k,n,p)
A. What is the probability that all three orders will be recognised correctly?
print("Probability that all three orders will be recognised correctly \
is %.6f"%(bi_p[3]))
Probability that all three orders will be recognised correctly is 0.653972
B. What is the probability that none of the three orders will be recognised correctly?
print("Probability that none of the three orders\n\
will be recognised correctly is %.6f"%(bi_p[0]))
Probability that none of the three orders will be recognised correctly is 0.002300
C. What is the probability that at least two of the three orders will be recognised correctly?
print("probability that at least two of the three orders\n\
will be recognised correctly is %.6f"%(bi_p[2:].sum()))
probability that at least two of the three orders will be recognised correctly is 0.952328
5. Question: A group of 300 professionals sat for a competitive exam. The results show the information of marks obtained by them have a mean of 60 and a standard deviation of 12. The pattern of marks follows a normal distribution. Answer the following questions.
mean = 60
std = 12
# lets build a pdf & cdf for scores from 1 to 100
k = np.arange(1,100.1,0.1)
n_p = stats.norm.pdf(k,mean,std)
n_c = stats.norm.cdf(k,mean,std)
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.plot(k,n_p)
plt.title("PDF")
plt.subplot(1,2,2)
plt.plot(k,n_c)
plt.title("CDF");
A. What is the percentage of students who score more than 80.
cn = pd.DataFrame(data=[k,n_c],index=['score','cdf']).T
# % of students with more than 80 score
a=1-cn.loc[cn['score']>=80,'cdf'].iloc[0]
print("Percentage of students who score more than 80 \
is %.2f%%"%(a*100))
Percentage of students who score more than 80 is 4.78%
B. What is the percentage of students who score less than 50.
# % of students score less than 50
b=cn.loc[cn['score']>=50,'cdf'].iloc[0]
print("Percentage of students who score less than 50 \
is %.2f%%"%(b*100))
Percentage of students who score less than 50 is 20.23%
C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?
# score above 90% of population i.e. top 10% students
c=cn.loc[cn['cdf']>=0.9,'score'].iloc[0]
print("distinction mark for highest 10%% students \
is %.2f"%(c))
distinction mark for highest 10% students is 75.40
6. Question: Explain 1 real life industry scenario [other than the ones mentioned above] where you can use the concepts learnt in this module of Applied statistics to get a data driven business solution.
In a use-case where, for instance, we have part supply demand vs shortage to be kept in control, and we have to explore the probabilites of a particular vendor (say x) could cause part shortage for say n times.
(This would call for Poissons probability function)
Such a study will help risk management and strategic planning
======================================================================================================
CONTEXT: Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.
Steps and tasks:
# import dataset
bb = pd.read_csv("DS - Part2 - Basketball.csv")
# display samples
bb.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
# Shape of the dataset
bb.shape
(61, 13)
There are 61 rows and 13 columns
# column names
bb.columns
Index(['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
The columns are given by
| Sl | Column Name | Description |
|---|---|---|
| 1. | Team | Team’s name |
| 2. | Tournament | Number of played tournaments. |
| 3. | Score | Team’s score so far. |
| 4. | PlayedGames | Games played by the team so far. |
| 5. | WonGames | Games won by the team so far. |
| 6. | DrawnGames | Games drawn by the team so far. |
| 7. | LostGames | Games lost by the team so far. |
| 8. | BasketScored | Basket scored by the team so far. |
| 9. | BasketGiven | Basket scored against the team so far. |
| 10. | TournamentChampion | How many times the team was a champion of the tournaments so far. |
| 11. | Runner-up | How many times the team was a runners-up of the tournaments so far. |
| 12. | TeamLaunch | Year the team was launched on professional basketball. |
| 13. | HighestPositionHeld | Highest position held by the team amongst all the tournaments played. |
# quantitative data description
bb.describe()
| Tournament | HighestPositionHeld | |
|---|---|---|
| count | 61.000000 | 61.000000 |
| mean | 24.000000 | 7.081967 |
| std | 26.827225 | 5.276663 |
| min | 1.000000 | 1.000000 |
| 25% | 4.000000 | 3.000000 |
| 50% | 12.000000 | 6.000000 |
| 75% | 38.000000 | 10.000000 |
| max | 86.000000 | 20.000000 |
# datatype & missing data overview
bb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
bb.columns[1:]
Index(['Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
it is seen that many columns are wrongly represented as object datatype
there seems to be no nulls
# convert elements to int
for i in bb.columns[1:]:
for j in bb.index:
try: # errors generated, hence lets exception handling
bb.loc[j,i]=int(bb.loc[j,i])
except:
print(i,j,bb.loc[j,i],sep=' : ',end='\n')
Score : 60 : - PlayedGames : 60 : - WonGames : 60 : - DrawnGames : 60 : - LostGames : 60 : - BasketScored : 60 : - BasketGiven : 60 : - TournamentChampion : 6 : - TournamentChampion : 8 : - TournamentChampion : 11 : - TournamentChampion : 12 : - TournamentChampion : 13 : - TournamentChampion : 14 : - TournamentChampion : 15 : - TournamentChampion : 16 : - TournamentChampion : 17 : - TournamentChampion : 18 : - TournamentChampion : 19 : - TournamentChampion : 20 : - TournamentChampion : 21 : - TournamentChampion : 22 : - TournamentChampion : 23 : - TournamentChampion : 24 : - TournamentChampion : 25 : - TournamentChampion : 26 : - TournamentChampion : 27 : - TournamentChampion : 28 : - TournamentChampion : 29 : - TournamentChampion : 30 : - TournamentChampion : 31 : - TournamentChampion : 32 : - TournamentChampion : 33 : - TournamentChampion : 34 : - TournamentChampion : 35 : - TournamentChampion : 36 : - TournamentChampion : 37 : - TournamentChampion : 38 : - TournamentChampion : 39 : - TournamentChampion : 40 : - TournamentChampion : 41 : - TournamentChampion : 42 : - TournamentChampion : 43 : - TournamentChampion : 44 : - TournamentChampion : 45 : - TournamentChampion : 46 : - TournamentChampion : 47 : - TournamentChampion : 48 : - TournamentChampion : 49 : - TournamentChampion : 50 : - TournamentChampion : 51 : - TournamentChampion : 52 : - TournamentChampion : 53 : - TournamentChampion : 54 : - TournamentChampion : 55 : - TournamentChampion : 56 : - TournamentChampion : 57 : - TournamentChampion : 58 : - TournamentChampion : 59 : - TournamentChampion : 60 : - Runner-up : 6 : - Runner-up : 9 : - Runner-up : 11 : - Runner-up : 12 : - Runner-up : 15 : - Runner-up : 16 : - Runner-up : 17 : - Runner-up : 18 : - Runner-up : 21 : - Runner-up : 22 : - Runner-up : 23 : - Runner-up : 24 : - Runner-up : 25 : - Runner-up : 26 : - Runner-up : 27 : - Runner-up : 28 : - Runner-up : 29 : - Runner-up : 30 : - Runner-up : 31 : - Runner-up : 32 : - Runner-up : 33 : - Runner-up : 34 : - Runner-up : 35 : - Runner-up : 36 : - Runner-up : 37 : - Runner-up : 38 : - Runner-up : 39 : - Runner-up : 40 : - Runner-up : 41 : - Runner-up : 42 : - Runner-up : 43 : - Runner-up : 44 : - Runner-up : 45 : - Runner-up : 46 : - Runner-up : 47 : - Runner-up : 48 : - Runner-up : 49 : - Runner-up : 50 : - Runner-up : 51 : - Runner-up : 52 : - Runner-up : 53 : - Runner-up : 54 : - Runner-up : 55 : - Runner-up : 56 : - Runner-up : 57 : - Runner-up : 58 : - Runner-up : 59 : - Runner-up : 60 : - TeamLaunch : 3 : 1931to32 TeamLaunch : 5 : 1934-35 TeamLaunch : 8 : 1939-40 TeamLaunch : 9 : 1932-33 TeamLaunch : 10 : 1941to42 TeamLaunch : 11 : 1939-40 TeamLaunch : 12 : 1948-49 TeamLaunch : 15 : 1935-36 TeamLaunch : 17 : 1933to34 TeamLaunch : 18 : 1960-61 TeamLaunch : 19 : 1951-52 TeamLaunch : 20 : 1998-99 TeamLaunch : 21 : 1941-42 TeamLaunch : 22 : 1977-78 TeamLaunch : 23 : 1959-60 TeamLaunch : 24 : 2004to05 TeamLaunch : 25 : 1935-36 TeamLaunch : 26 : 1961-62 TeamLaunch : 27 : 1940-41 TeamLaunch : 28 : 1930-31 TeamLaunch : 29 : 1963-64 TeamLaunch : 30 : 1974-75 TeamLaunch : 31 : 1943-44 TeamLaunch : 32 : 1977-78 TeamLaunch : 33 : 1987-88 TeamLaunch : 34 : 1941-42 TeamLaunch : 36 : 2007-08 TeamLaunch : 37 : 1962-63 TeamLaunch : 38 : 1994-95 TeamLaunch : 39 : 1978-79 TeamLaunch : 40 : 1971-72 TeamLaunch : 41 : 1963-64 TeamLaunch : 42 : 1999to00 TeamLaunch : 43 : 2014-15 TeamLaunch : 45 : 1990-91 TeamLaunch : 46 : 1947-48 TeamLaunch : 47 : 1996-97 TeamLaunch : 48 : 1995-96 TeamLaunch : 49 : 1945-46 TeamLaunch : 50 : 1953-54 TeamLaunch : 52 : 1979-80 TeamLaunch : 54 : 1950-51 TeamLaunch : 56 : 2009-10 TeamLaunch : 57 : 1956-57 TeamLaunch : 58 : 1951~52 TeamLaunch : 59 : 1955-56 TeamLaunch : 60 : 2017~18
Type casting creates error where non numerical values are present
A. Major invalid data is found to be "-"
B. Team Launch column has mixed entries which has to be processed furhter
C. Last entry at 60th index has maximum "-"
# count invalid hyphens in each row
bb['hyphen']=np.zeros((61,1))
for i in bb.index:
try:
bb['hyphen'].iloc[i]=bb.iloc[i].value_counts()['-']
except:
next
bb.hyphen.value_counts()
D:\Anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
2.0 46 0.0 8 1.0 6 9.0 1 Name: hyphen, dtype: int64
bb.loc[bb.hyphen>2].index[0]
60
as said above, there are several "-" in row index 60
lets drop that row as the team have never played games, leaving us with no clues about their performance / capabilites
# drop 60th row
bb=bb.drop(index=60)
bb.shape
(60, 14)
Lets impute the remaining "-" in the "TournamentChampion" and "Runner-up" columns
# assign zeros to number of tournaments won or been runner up
bb["TournamentChampion"].loc[bb["TournamentChampion"]=="-"]=0
bb["Runner-up"].loc[bb["Runner-up"]=="-"]=0
bb=bb.drop(columns="hyphen")
D:\Anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
# review the Team Launch column
bb.TeamLaunch.to_frame().T
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TeamLaunch | 1929 | 1929 | 1929 | 1931to32 | 1929 | 1934-35 | 1929 | 1929 | 1939-40 | 1932-33 | ... | 1953-54 | 1929 | 1979-80 | 1929 | 1950-51 | 201617 | 2009-10 | 1956-57 | 1951~52 | 1955-56 |
1 rows × 60 columns
The TeamLaunch column contains year of launch, either in Gregorian calendar year like YYYY or probably financila/academic year ranges like YYYY-YY
Hence lets extract the Launch Year in Gregorian Format of YYYY alone
from datetime import date as dt
# Converting 'to' to '-'
for i in bb.index:
if type(bb.TeamLaunch.iloc[i]) is str:
bb.TeamLaunch.iloc[i]=bb.TeamLaunch.iloc[i].replace('to','-')
# extracting sets of 4 digits, delimiter '-' or '~' and 2 digits
tl=bb.TeamLaunch.str.extract(r'(\d\d\d\d)?([-~])?(\d\d)')
for i in bb.index:
j=bb.TeamLaunch.iloc[i]
if type(j) is str: # copying back above extracted years to TeamLaunch
bb.TeamLaunch.iloc[i]=int(tl.iloc[i,0])
elif type(j) is int:
if j>9999: # extracting year from YYYYYY formated numbers
bb.TeamLaunch.iloc[i]=int(np.divmod(j,100)[0])
D:\Anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
now lets convert all the colums (expect Team) in to integer types
for i in bb.columns[1:]:
try:
bb[i]=bb[i].astype(int)
except:
print("error in",i)
All columns converted without error
Rearrange dataset and convert Team name as the index
and lets check the info & description
bb=bb[['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld']]
bb=bb.set_index('Team')
bb.info()
<class 'pandas.core.frame.DataFrame'> Index: 60 entries, Team 1 to Team 60 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Tournament 60 non-null int32 1 Score 60 non-null int32 2 PlayedGames 60 non-null int32 3 WonGames 60 non-null int32 4 DrawnGames 60 non-null int32 5 LostGames 60 non-null int32 6 BasketScored 60 non-null int32 7 BasketGiven 60 non-null int32 8 TournamentChampion 60 non-null int32 9 Runner-up 60 non-null int32 10 TeamLaunch 60 non-null int32 11 HighestPositionHeld 60 non-null int32 dtypes: int32(12) memory usage: 3.3+ KB
All invalid data points are cleaned and is ready for analysis
bb.describe()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.00000 | 60.000000 | 60.000000 | 60.000000 |
| mean | 24.383333 | 916.450000 | 810.100000 | 309.033333 | 192.083333 | 308.816667 | 1159.350000 | 1159.233333 | 1.45000 | 1.433333 | 1957.950000 | 7.050000 |
| std | 26.884620 | 1138.342899 | 877.465393 | 408.481395 | 201.985508 | 294.508639 | 1512.063948 | 1163.946914 | 5.51554 | 4.574679 | 26.646732 | 5.315232 |
| min | 1.000000 | 14.000000 | 30.000000 | 5.000000 | 4.000000 | 15.000000 | 34.000000 | 55.000000 | 0.00000 | 0.000000 | 1929.000000 | 1.000000 |
| 25% | 4.000000 | 104.250000 | 115.500000 | 34.750000 | 26.250000 | 62.750000 | 154.500000 | 236.000000 | 0.00000 | 0.000000 | 1934.750000 | 3.000000 |
| 50% | 12.000000 | 395.500000 | 424.500000 | 124.000000 | 98.500000 | 197.500000 | 444.000000 | 632.500000 | 0.00000 | 0.000000 | 1950.500000 | 6.000000 |
| 75% | 39.000000 | 1360.500000 | 1345.500000 | 432.750000 | 331.500000 | 563.500000 | 1669.750000 | 2001.250000 | 0.00000 | 0.000000 | 1977.250000 | 10.000000 |
| max | 86.000000 | 4385.000000 | 2762.000000 | 1647.000000 | 633.000000 | 1070.000000 | 5947.000000 | 3889.000000 | 33.00000 | 25.000000 | 2016.000000 | 20.000000 |
The attributes of the data are varying in scales ranging from 10s to 1000s
from plotly import express as px
fig = px.box(bb, y=bb.columns)
fig.show()
Though it is hard to visualise the distribution of individual attributes due to huge scaling factors,
we are able to find significant outliers in 3 columns.
Those outliers in Score, WonGames, BasketScored cannot be excluded from datapoints as those exceptional performances are defining the top teams
bb.hist(layout=(3,4),figsize=(15,10));
The attributes doesn't follow normal distribution,
probably because of various generations of teams being compared here (TeamLaunch ranges over 60 years)
lets study attribute-wise distribution to get a better picture
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=2,cols=1)
fig.update_layout(
updatemenus=[
dict(
active=0,
buttons=list([
dict(label="All",method="update",
args=[{"visible":[True,True,True,True,True,True,True,True,True,True,True,True,
True,True,True,True,True,True,True,True,True,True,True,True]},
{"title":"All data points shown, select from list -->"
+'<br>'+"Inferences will be written alongside each plot",
"annotations":[]}]),
dict(label="Tournament",method="update",
args=[{"visible":[True,True,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"Tournament",
"annotations":[dict(text="Number of tournaments played is Right Skewed",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="24 of the teams have played less than 8 tournamnets",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="Score",method="update",
args=[{"visible":[False,False,True,True,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"Score",
"annotations":[dict(text="5 outstanding (outlier) teams with scores more than 2819",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="more than 50% of teams (32 teams) have scored less than 430",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="PlayedGames",method="update",
args=[{"visible":[False,False,False,False,True,True,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"PlayedGames",
"annotations":[dict(text="number of games played is Right Skewed (similar to tournaments)",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="24 of the teams have played less than 275 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="WonGames",method="update",
args=[{"visible":[False,False,False,False,False,False,True,True,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"WonGames",
"annotations":[dict(text="5 outstanding (outlier) teams have won more than 990 games",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="more than 50% of teams (35 teams) have won less than 165 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="DrawnGames",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,True,True,False,False,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"DrawnGames",
"annotations":[dict(text="number of games drawin is also Right Skewed",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="again, 23 of the teams have have drawn 61 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="LostGames",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,True,True,
False,False,False,False,False,False,False,False,False,False,False,False]},
{"title":"LostGames",
"annotations":[dict(text="number of games lost is also Right Skewed",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="22 of the teams have lost upto 105 games",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="BasketScored",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
True,True,False,False,False,False,False,False,False,False,False,False]},
{"title":"BasketScored",
"annotations":[dict(text="5 outstanding (outlier) teams have scored more than 3680 baskets",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="more than 50% of teams (33 teams) have scored less than 600 baskets",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="BasketGiven",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,True,True,False,False,False,False,False,False,False,False]},
{"title":"BasketGiven",
"annotations":[dict(text="The Right Skew shows that all teams have significantly given baskets, meaning high competition",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="23 of the teams have given less than 380 baskets",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="TournamentChampion",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,True,True,False,False,False,False,False,False]},
{"title":"TournamentChampion",
"annotations":[dict(text="less than 10 teams had been champions",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="the fences merge due to the fact that major teams have never been champions",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="Runner-up",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,True,True,False,False,False,False]},
{"title":"Runner-up",
"annotations":[dict(text="little more than 10 teams had been runners",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="the fences merge due to the fact that major teams have never been runners",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="TeamLaunch",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,True,True,False,False]},
{"title":"TeamLaunch",
"annotations":[dict(text="wide aged teams are considered in the dataset, with significant number of teams formed much earlier",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="upto 11 teams formed before 1931, new team formation peaked around 1945 and again around 1995",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
dict(label="HighestPositionHeld",method="update",
args=[{"visible":[False,False,False,False,False,False,False,False,False,False,False,False,
False,False,False,False,False,False,False,False,False,False,True,True]},
{"title":"HighestPositionHeld",
"annotations":[dict(text="a right skew means that, quite a large number of teams have secured better positions (lower positional value means better rank)",
xref='paper',yref='paper',x=0,y=1.05,align='left',showarrow=False),
dict(text="18 teams had secured top 3 positions at the highest",
xref='paper',yref='paper',x=0,y=0.45,align='left',showarrow=False)]}]),
]),
direction="down",
showactive=True,
x=0.8,
xanchor="left",
y=1.2,
yanchor="top"
)
]
)
fig.update_layout(showlegend=False,title="All data points shown, select from list -->"
+'<br>'+"Inferences will be written alongside each plot")
for i in bb.columns:
fig.add_trace(go.Box(x=bb[i],name='',hovertemplate='%{x}',jitter=1),1,1)
fig.add_trace(go.Histogram(x=bb[i],name=i,xbins_size=(bb[i].max()-bb[i].min())/10),2,1)
fig.show()
with most of the attributes being right skewed and none following a normal distribution,
it will be difficult to determine better performing teams
import seaborn as sns
sns.pairplot(bb); # scatter plot between all variable combinations, depicting any possible relations
quite a lot of attributes are found to be related either positively or inversely
let us review the correlation coefficient to measure the relationships
plt.figure(figsize=(14,10))
sns.heatmap(bb.corr(),annot=True,vmax=1,vmin=-1,cmap='RdYlGn');
bb["TeamAge"]=2021-bb.TeamLaunch
bb["Level"]=21-bb.HighestPositionHeld
bb["WinRatio"]=bb.WonGames.div(bb.PlayedGames)
bb["LoseRatio"]=bb.LostGames.div(bb.PlayedGames)
bb["DrawRatio"]=bb.DrawnGames.div(bb.PlayedGames)
bb["PlayRate"]=bb.PlayedGames.div(bb.TeamAge)
newbb=bb[['PlayedGames','WinRatio','DrawRatio','LoseRatio','BasketGiven','TournamentChampion','Runner-up',"TeamAge","Level","PlayRate"]]
newbb.head()
| PlayedGames | WinRatio | DrawRatio | LoseRatio | BasketGiven | TournamentChampion | Runner-up | TeamAge | Level | PlayRate | |
|---|---|---|---|---|---|---|---|---|---|---|
| Team | ||||||||||
| Team 1 | 2762 | 0.596307 | 0.199855 | 0.203838 | 3140 | 33 | 23 | 92 | 20 | 30.021739 |
| Team 2 | 2762 | 0.572411 | 0.207458 | 0.220130 | 3114 | 25 | 25 | 92 | 20 | 30.021739 |
| Team 3 | 2614 | 0.474751 | 0.228768 | 0.296480 | 3309 | 10 | 8 | 92 | 20 | 28.413043 |
| Team 4 | 2664 | 0.445571 | 0.231231 | 0.323198 | 3469 | 6 | 6 | 90 | 20 | 29.600000 |
| Team 5 | 2762 | 0.437726 | 0.229182 | 0.333092 | 3700 | 8 | 7 | 92 | 20 | 30.021739 |
Lets us study the relationships with refined attributes of the set of teams
plt.figure(figsize=(14,10))
sns.heatmap(newbb.corr(),annot=True,vmax=1,vmin=-1,cmap='RdYlGn');
Now, the refined attributes define the group of teams more accurately
Having arrived at meaningful qualities of the group of teams
one may intuitively choose teams with high WinRatio to invest on
So lets see if that is a worthy of investment
fig=px.scatter(hover_data=[newbb.index],data_frame=newbb,x='TeamAge',y='WinRatio',
size=newbb.Level,color='PlayRate',color_continuous_scale='Turbo')
fig.add_annotation(x=23, y=0.44,text="Team21",showarrow=False)
fig.add_annotation(x=17, y=0.35,text="Team25",showarrow=False)
Interestingly, Yes the teams with high WinRatios have been TournamentChampions for several times (Teams 1 to 5)
But those are the oldest teams amongst the group, and are expected to have been contract with Competitors
So who are we left with? with only young teams!!!
Surprisingly, among teams not older than 25 years, there are 2 budding performers with high perseverence
Teams 21 & 25 has shown high interest to frequently play
(against teams of age<=25)
Compare2 = pd.DataFrame(columns=newbb.columns)
Compare2.loc["Mean",:]=newbb.loc[bb.TeamAge<=25].mean()
Compare2.loc["Std",:]=newbb.loc[bb.TeamAge<=25].std()
Compare2.loc["Team 21",:]=newbb.loc["Team 21",:]
Compare2.loc["Team 25",:]=newbb.loc["Team 25",:]
Compare2
| PlayedGames | WinRatio | DrawRatio | LoseRatio | BasketGiven | TournamentChampion | Runner-up | TeamAge | Level | PlayRate | |
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 219.0 | 0.278444 | 0.259735 | 0.461821 | 305.75 | 0.0 | 0.125 | 15.625 | 8.875 | 13.544709 |
| Std | 220.044151 | 0.067426 | 0.020349 | 0.064536 | 273.148181 | 0.0 | 0.353553 | 7.443837 | 6.490377 | 9.977962 |
| Team 21 | 646.0 | 0.411765 | 0.266254 | 0.321981 | 789.0 | 0.0 | 1.0 | 23.0 | 19.0 | 28.086957 |
| Team 25 | 456.0 | 0.322368 | 0.245614 | 0.432018 | 633.0 | 0.0 | 0.0 | 17.0 | 15.0 | 26.823529 |
======================================================================================================
CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.
OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.
Steps and tasks:
# read file
prod=pd.read_csv("DS - Part3 - CompanyX_EU.csv")
prod.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
# shape & size of the dataset
prod.shape
(662, 6)
ATTRIBUTE INFORMATION
Each row in the dataset is a Start-up company and the columns describe the company
| Column | Description |
|---|---|
| Startup | Name of the company |
| Product | Actual product |
| Funding | Funds raised by the company in USD |
| Event | The event the company participated in |
| Result | Described by Contestant, Finalist, Audience choice, Winner or Runner up |
| OperatingState | Current status of the company, Operating ,Closed, Acquired or IPO |
prod.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
All the attributes are found to be of object datatype Going forwards, Funding column must be considered for appropriate conversion
prod.isna().value_counts().to_frame()
| 0 | ||||||
|---|---|---|---|---|---|---|
| Startup | Product | Funding | Event | Result | OperatingState | |
| False | False | False | False | False | False | 446 |
| True | False | False | False | 210 | ||
| True | True | False | False | False | 4 | |
| False | False | False | False | 2 |
There are a total of 210 + 4 + 2 = 216 records with nulls or Nans
prod.dropna(axis=0,how='any',inplace=True)
prod.isna().value_counts().to_frame()
| 0 | ||||||
|---|---|---|---|---|---|---|
| Startup | Product | Funding | Event | Result | OperatingState | |
| False | False | False | False | False | False | 446 |
# preview
prod.Funding.head()
1 $630K 3 $1M 5 $19.3M 6 $29M 10 $1.8M Name: Funding, dtype: object
# check for variety of currency units
prod.Funding.apply(lambda x: x[0]).value_counts()
$ 446 Name: Funding, dtype: int64
# only Dollars found
# check for variety of multiplier units
prod.Funding.apply(lambda x: x[-1]).value_counts()
M 309 K 136 B 1 Name: Funding, dtype: int64
# three different multipliers found
# check for negative sign in 2nd position
prod.Funding.apply(lambda x: x[1]).value_counts()
1 149 2 81 3 60 5 35 6 33 4 30 7 25 8 19 9 14 Name: Funding, dtype: int64
# no negative numbers found
import re
wildcard=r"""([$])(\d*\.*\d*)([KMB])"""
# extract components of the string
funds=prod.Funding.apply(lambda x:re.findall(wildcard,x))
# type casting to float & apply MILLION conversion
prod.Funding=funds.apply(lambda x:float(x[0][1])/1000 if x[0][2]=='K' else
float(x[0][1]) if x[0][2]=='M' else
float(x[0][1])*1000)
# review conversion to MILLION units
prod.Funding.head()
1 0.63 3 1.00 5 19.30 6 29.00 10 1.80 Name: Funding, dtype: float64
fig=go.Figure()
fig.add_trace(go.Box(x=prod.Funding,jitter=0,name='Funding',boxpoints='all'))
fig.add_annotation(x=400, y=-0.45,showarrow=False,
text="the data is heavily right skewed"+'<br>'+
"giving huge number of outliers")
# outlier summary
q1=prod["Funding"].quantile(q=0.25,interpolation='nearest') #quantiles
q3=prod["Funding"].quantile(q=0.75,interpolation='nearest')
IQR=q3-q1
lf=q1-(1.5*IQR)
lower_fence=np.sort(prod.loc[prod["Funding"]>=lf,"Funding"])[0] #whiskers
uf=q3+(1.5*IQR)
upper_fence=np.sort(prod.loc[prod["Funding"]<=uf,"Funding"])[-1]
print("Outlier count:",prod.loc[prod["Funding"]>upper_fence].shape[0]) #number of outliers
print("Total Records:",prod.shape[0])
Outlier count: 60 Total Records: 446
The above graph indicates heavy skewness in data, also depicting a whole lot of 60 records of Funding values as outliers.
But comparing with the sample size of just 446, the count of outliers is acounting to 13.45%
labeling more than 10% of available sample as outliers and excluding them from further analysis will greatly influence the sample data distribution
hence let us try transforming the Funding data, to obtain better clarity on data
Experiment further
try to rescale the Funding information
prod["log(Fund)"]=pd.DataFrame(np.log(prod.Funding*1000)) #log scaling
# multiplied by 1000 to convert to K, so as to avoid negative log
fig=go.Figure()
fig.add_trace(go.Box(x=prod["log(Fund)"],jitter=1,name='log(Fund)',boxpoints='all'))
fig.add_annotation(x=7.5, y=0.5,showarrow=False,
text="BOX PLOT for FUNDS in log scale <br>log transform reduces the right skewedness")
# plotly doesn't feature any extracting methods
# hence lets arrive at the fence values mathematically
# also exactly matching with the plot data
q1=prod["log(Fund)"].quantile(q=0.25,interpolation='nearest')
q3=prod["log(Fund)"].quantile(q=0.75,interpolation='nearest')
IQR=q3-q1
lf=q1-(1.5*IQR)
lower_fence=np.sort(prod.loc[prod["log(Fund)"]>=lf,"log(Fund)"])[0]
uf=q3+(1.5*IQR)
upper_fence=np.sort(prod.loc[prod["log(Fund)"]<=uf,"log(Fund)"])[-1]
print("lower fence %.5f\n\
upper fence %.5f"%(lower_fence,upper_fence))
lower fence 2.88480 upper fence 12.71409
print("Number of datapoints above upper fence are %d"
%(prod.loc[prod["log(Fund)"]>upper_fence].shape[0]))
print("Number of datapoints below lower fence are %d"
%(prod.loc[prod["log(Fund)"]<lower_fence].shape[0]))
Number of datapoints above upper fence are 2 Number of datapoints below lower fence are 1
# let us drop all the outliers, lower & upper outliers
# 1. visualise the outliers
display(prod.loc[prod["log(Fund)"]>upper_fence])
display(prod.loc[prod["log(Fund)"]<lower_fence])
| Startup | Product | Funding | Event | Result | OperatingState | log(Fund) | |
|---|---|---|---|---|---|---|---|
| 154 | Dropbox | dropbox.com | 1700.0 | TC50 2008 | Contestant | Operating | 14.346139 |
| 656 | Zenefits | zenefits.com | 583.6 | Disrupt NYC 2013 | Finalist | Operating | 13.276971 |
| Startup | Product | Funding | Event | Result | OperatingState | log(Fund) | |
|---|---|---|---|---|---|---|---|
| 198 | FoodStantly | foodstantly.com | 0.005 | Disrupt London 2016 | Contestant | Operating | 1.609438 |
# 2. drop the outliers
prod.drop(prod.loc[prod["log(Fund)"]>upper_fence].index, inplace=True)
prod.drop(prod.loc[prod["log(Fund)"]<lower_fence].index,inplace=True)
fig=go.Figure()
fig.add_trace(go.Box(x=prod["log(Fund)"],jitter=1,name='log(Fund)',boxpoints='all'))
fig.add_annotation(x=7.5, y=0.5,showarrow=False,
text="BOX PLOT for FUNDS in log scale <br>all datapoints are placed within the whiskers")
prod.OperatingState.value_counts()
Operating 316 Acquired 66 Closed 57 Ipo 4 Name: OperatingState, dtype: int64
out of 443 companies under study, 386 are functional & 57 have been closed
fig=go.Figure()
fig.add_trace(go.Histogram(name='OperatingState',x=prod.OperatingState))
fig=make_subplots(2,1)
fig.add_trace(go.Histogram(name='Funds in Millions',x=prod.Funding),1,1)
fig.add_trace(go.Histogram(name='Funds in log scale',x=prod["log(Fund)"]),2,1)
fig.update_layout(showlegend=False)
fig.add_annotation(x=0,y=1.05,xref='paper',yref='paper',showarrow=False,
text="Funds in Millions")
fig.add_annotation(x=0,y=0.45,xref='paper',yref='paper',showarrow=False,
text="Funds in log scale")
notably, log transform has resulted in a normal distribution
fig=make_subplots(2,2)
fig.add_trace(go.Histogram(name="Operating(Mil)",x=prod.loc[prod.OperatingState == 'Operating',"Funding"]),1,1)
fig.add_trace(go.Histogram(name="Closed(Mil)",x=prod.loc[prod.OperatingState == 'Closed',"Funding"]),1,2)
fig.add_trace(go.Histogram(name="Operating(log)",x=prod.loc[prod.OperatingState == 'Operating',"log(Fund)"]),2,1)
fig.add_trace(go.Histogram(name="Closed(log)",x=prod.loc[prod.OperatingState == 'Closed',"log(Fund)"]),2,2)
Based on above plots,
# lets describe the distribution
prod[["Funding","OperatingState"]].groupby(by="OperatingState").describe().T
| OperatingState | Acquired | Closed | Ipo | Operating | |
|---|---|---|---|---|---|
| Funding | count | 66.000000 | 57.000000 | 4.000000 | 316.000000 |
| mean | 13.212894 | 3.258170 | 137.775000 | 12.016558 | |
| std | 22.033230 | 5.923294 | 134.179988 | 32.077520 | |
| min | 0.020000 | 0.093000 | 34.900000 | 0.017900 | |
| 25% | 1.225000 | 0.475000 | 58.225000 | 0.745750 | |
| 50% | 5.800000 | 0.929200 | 91.900000 | 2.200000 | |
| 75% | 16.275000 | 3.500000 | 171.450000 | 8.125000 | |
| max | 142.000000 | 35.500000 | 332.400000 | 278.000000 |
The above description will suggest that the means and spread of Funding significantly varies between Operating & closed Companies
But, the influence of skewness could raise an ambiguity over the inference
Hence lets review the same in log transformed data
desc=prod[["log(Fund)","OperatingState"]].groupby(by="OperatingState").describe().T
desc
| OperatingState | Acquired | Closed | Ipo | Operating | |
|---|---|---|---|---|---|
| log(Fund) | count | 66.000000 | 57.000000 | 4.000000 | 316.000000 |
| mean | 8.212889 | 7.062826 | 11.487123 | 7.731426 | |
| std | 2.030676 | 1.427570 | 0.957040 | 1.928969 | |
| min | 2.995732 | 4.532599 | 10.460242 | 2.884801 | |
| 25% | 7.110088 | 6.163315 | 10.938118 | 6.614390 | |
| 50% | 8.661884 | 6.834324 | 11.387077 | 7.696213 | |
| 75% | 9.697093 | 8.160518 | 11.936081 | 9.002687 | |
| max | 11.863582 | 10.477288 | 12.714094 | 12.535376 |
print("mean of Funds to Operating Companies: $%.2f millions\n\
with standard deviation: $%.3f millions\n"
%(np.e**desc.Operating["log(Fund)","mean"]/1000,
np.e**desc.Operating["log(Fund)","std"]/1000))
print("mean of Funds to Closed Companies: $%.2f millions\n\
with standard deviation: $%.3f millions"
%(np.e**desc.Closed["log(Fund)","mean"]/1000,
np.e**desc.Closed["log(Fund)","std"]/1000))
mean of Funds to Operating Companies: $2.28 millions with standard deviation: $0.007 millions mean of Funds to Closed Companies: $1.17 millions with standard deviation: $0.004 millions
The previous inference is supported by log transformed data also
Funds allocated to Closed companies we far less than those allocated to successfully operating companies
Lets us also verify the same using a 2 sample t test
Null Hypothesis Ho
Funds allocated to either classification of companies are similar
Alternate Hypothesis Ha
Funds allocated to Operating companies significantly vary than that of Closed companies
from scipy.stats import ttest_ind
s1=prod.loc[prod.OperatingState=="Operating","Funding"]
s2=prod.loc[prod.OperatingState=="Closed","Funding"]
alpha=0.05 # critical
t_res=ttest_ind(s1,s2)
if t_res.pvalue<=alpha:
print("The pvalue %.3f being within alpha %.2f,\n\
the samples are significantly different,\n\
thus we reject the Null Hypothesis Ho"%(t_res.pvalue,alpha))
else:
print("The pvalue %.3f being above alpha %.2f,\n\
the samples are not significantly different,\n\
thus we fail to reject the Null Hypothesis Ho"%(t_res.pvalue,alpha))
The pvalue 0.041 being within alpha 0.05, the samples are significantly different, thus we reject the Null Hypothesis Ho
Conclusion: The above test reiterates that the funds allocated are not similar
df=pd.read_csv("DS - Part3 - CompanyX_EU.csv")
df.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
df.Result.value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
fig=go.Figure()
fig.add_trace(go.Histogram(name='Result',x=df.Result))
Considering all recognised companies as winners, for the sake of analysis
conts=df.Result.value_counts().loc['Contestant']
wins=df.shape[0]-conts
op_conts=df.loc[df.Result=='Contestant',"OperatingState"].value_counts().loc["Operating"]
op_wins=df.loc[df.Result!='Contestant',"OperatingState"].value_counts().loc["Operating"]
print("Percentage of winners still operating: %%%.2f"%(op_wins/wins*100))
print("Percentage of contestants still operating: %%%.2f"%(op_conts/conts*100))
Percentage of winners still operating: %76.44 Percentage of contestants still operating: %68.03
Z test of proportions
Null Hypothesis Ho
Proportion of Winner Companies & Contestant Companies are similar
Alternate Hypothesis Ha
Proportion of Winner Companies & Contestant Companies are significantly different
from statsmodels.stats.proportion import proportions_ztest
stat,p=proportions_ztest(count=[op_conts,op_wins],nobs=[conts,wins])
alpha=0.05 # critical
if p<=alpha:
print("The pvalue %.3f being within alpha %.2f,\n\
the samples are significantly different,\n\
thus we reject the Null Hypothesis Ho"%(p,alpha))
else:
print("The pvalue %.3f being above alpha %.2f,\n\
the samples are not significantly different,\n\
thus we fail to reject the Null Hypothesis Ho"%(p,alpha))
The pvalue 0.037 being within alpha 0.05, the samples are significantly different, thus we reject the Null Hypothesis Ho
Conclusion: compannies recognised in the Startup Battlefield event have survived better than the remaining contestants
df.Event.value_counts().to_frame()
| Event | |
|---|---|
| TC50 2008 | 52 |
| TC50 2009 | 50 |
| TC40 2007 | 40 |
| Disrupt NYC 2011 | 32 |
| Disrupt NYC 2013 | 31 |
| Disrupt SF 2011 | 31 |
| Disrupt SF 2013 | 31 |
| Disrupt NYC 2012 | 30 |
| Disrupt SF 2012 | 30 |
| Disrupt SF 2014 | 28 |
| Disrupt NYC 2014 | 27 |
| Disrupt SF 2015 | 27 |
| Disrupt SF 2016 | 26 |
| Disrupt SF 2010 | 26 |
| Disrupt NY 2015 | 26 |
| Disrupt NYC 2010 | 22 |
| Disrupt NY 2016 | 22 |
| Disrupt Beijing 2011 | 18 |
| Disrupt EU 2013 | 15 |
| Hardware Battlefield 2015 | 15 |
| Disrupt EU 2014 | 15 |
| Disrupt London 2015 | 15 |
| Hardware Battlefield 2014 | 14 |
| Disrupt London 2016 | 13 |
| Hardware Battlefield 2016 | 13 |
| - | 13 |
fig=go.Figure()
fig.add_trace(go.Histogram(name='Event',x=df.Event))
TC50 2008 & 2009 has seen maximum number of contestants
flag1=df.Event.loc[df.Event!="-"].apply(lambda x:re.findall("Disrupt",x)==['Disrupt'])
flag2=df.Event.loc[df.Event!="-"].apply(lambda x:int(x[-4:])>=2013)
df.Event.loc[df.Event!="-"][flag1&flag2].value_counts()
Disrupt NYC 2013 31 Disrupt SF 2013 31 Disrupt SF 2014 28 Disrupt SF 2015 27 Disrupt NYC 2014 27 Disrupt SF 2016 26 Disrupt NY 2015 26 Disrupt NY 2016 22 Disrupt EU 2014 15 Disrupt EU 2013 15 Disrupt London 2015 15 Disrupt London 2016 13 Name: Event, dtype: int64
fig=go.Figure()
fig.add_trace(go.Histogram(name='Event',x=df.Event.loc[df.Event!="-"][flag1&flag2]))
#Lets consider preprocessed dataset for Funds analysis
prod.head()
| Startup | Product | Funding | Event | Result | OperatingState | log(Fund) | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | 0.63 | Disrupt NYC 2013 | Contestant | Closed | 6.445720 |
| 3 | 3Dprintler | 3dprintler.com | 1.00 | Disrupt NY 2016 | Audience choice | Operating | 6.907755 |
| 5 | 5to1 | 5to1.com | 19.30 | TC50 2009 | Contestant | Acquired | 9.867860 |
| 6 | 8 Securities | 8securities.com | 29.00 | Disrupt Beijing 2011 | Finalist | Operating | 10.275051 |
| 10 | AdhereTech | adheretech.com | 1.80 | Hardware Battlefield 2014 | Contestant | Operating | 7.495542 |
# flags to mark the cities
fny=prod.Event.loc[df.Event!="-"].apply(lambda x:re.findall("NY",x)==['NY'])
fsf=prod.Event.loc[df.Event!="-"].apply(lambda x:re.findall("SF",x)==['SF'])
feu=prod.Event.loc[df.Event!="-"].apply(lambda x:re.findall("EU",x)==['EU'])
f13=prod.Event.loc[df.Event!="-"].apply(lambda x:int(x[-4:])>=2013)
NY=prod.Funding.loc[df.Event!="-"][fny&f13]
SF=prod.Funding.loc[df.Event!="-"][fsf&f13]
EU=prod.Funding.loc[df.Event!="-"][feu&f13]
one way test
Null Hypothesis Ho
funds across 3 cities are same
Alternate Hypothesis Ha
funds across 3 cities are different
from scipy.stats import f_oneway
s,p=f_oneway(NY, SF, EU)
alpha=0.05 # critical
if p<=alpha:
print("The pvalue %.3f being within alpha %.2f,\n\
the samples are significantly different,\n\
thus we reject the Null Hypothesis Ho"%(p,alpha))
else:
print("The pvalue %.3f being above alpha %.2f,\n\
the samples are not significantly different,\n\
thus we fail to reject the Null Hypothesis Ho"%(p,alpha))
The pvalue 0.779 being above alpha 0.05, the samples are not significantly different, thus we fail to reject the Null Hypothesis Ho
Hence the distribution of funds across the 3 cities are similar
fig=go.Figure()
fig.add_trace(go.Histogram(name='NY',x=NY))
fig.add_trace(go.Histogram(name='SF',x=SF))
fig.add_trace(go.Histogram(name='EU',x=EU))
Suggestions: provide details on funding reasons to filter better.